arXiv:1506.02107v3 [stat.ML] 10 Oct 2015 


Data-Driven Learning of the Number of States in 
Multi-State Autoregressive Models 

Jie Ding, Mohammad Noshad, and Vahid Tarokh 
School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 


Abstract —In this work, we consider the class of multi-state 
autoregressive processes that can be used to model non-stationary 
time series of interest. In order to capture different autoregressive 
(AR) states underlying an observed time series, it is crucial to 
select the appropriate number of states. We propose a new and 
intuitive model selection technique based on the Gap statistics, 
which uses a null reference distribution on the stable AR filters to 
identify whether adding a new AR state significantly improves 
the performance of the model. To that end, we define a new 
distance measure between two AR filters based on the mean 
squared prediction error, and propose an efficient method to 
generate stable filters that are uniformly distributed in the 
coefficient space. Numerical results are provided to evaluate the 
performance of the proposed approach. 

I. Introduction 

Modeling and forecasting time series is of fundamental 
importance in various applications. There may be occasional 
changes of behavior in a time series. Some examples are 
the changes in the stock market due to the financial crisis, 
or the variations of an EEG signal caused by the mode 
change in the brain. In the econometrics literature, this kind 
of time series is referred to as regime-switching model (TJ, 
El- In regime switching models, the time series = 

1,2,...} is assumed to have M states, and if x^ belongs to 
state to (to = 1,2,... ,M), the probability density function 
(pdf) of x conditioning on its past is in the form of 
fm{x ^..., a;W). The autoregressive (AR) model, 
one of the commonly used techniques to model stationary time 
series ED, is usually used to model each state. The autore¬ 
gression of state to is given by x^ + j^x^ = where 
£ ( n ) are independent and identically distributed (i.i.d.) noises 
with zero mean and variance cr^. Here x^ = [l,a;( n_1 ), 
■ 7 m = [ 7 m 0 , 7 mi, • • • , 7 mi] T is a real-valued 

vector of length L + 1 that characterizes state to. A more 
detailed survey on this model can be found in 0. We refer 
to this model as a multi-state AR model and to 7 m as the 
AR filter or AR coefficients of state to. The above model 
with 7 m = 1 was first analyzed by Lindgren IS and Baum 
et al. 0 . The model with general 7 m is widely studied 
in the speech recognition literature 0. The multi-state AR 
model is a general statistical model that can be used to fit 
data in many real world applications. It was shown that the 
model is capable of representing non-linear and non-stationary 
time series with multimodal conditional distributions and with 
heteroscedasticity 0 . There are two basic underlying assump¬ 
tions in this model: 1. Autoregression assumption, which is 
reasonable if the observations are obtained sequentially in 


time; 2. Multi-state assumption, which is reasonable if the 
stochastic process exhibits different behaviors in different time 
epochs. For example, stock prices may have dramatic while not 
permanent changes in the case of business cycles or financial 
crises, and those dynamics can be described by stochastic 
transitions among different states. 

Despite the wide applications of the multi-state AR model, 
there are few results on how to estimate the number of states 
M in a time series. Obviously, different values of M produce 
a nested family of models and models with larger M’s fit the 
observed data better. The drawback of using complex models 
with a large M is the over-fitting problem which decreases 
the predictive power of the model. Hence, a proper model 
selection procedure that identifies the appropriate number 
of states is vital. It is tempting to test the null hypothesis 
that there are M states against the alternative of M + 1. 
Unfortunately, the likelihood ratio test of this hypothesis fails 
to satisfy the usual regularity conditions since some parameters 
of the model are unidentified under the null hypothesis. An 
alternative is to apply Akaike information criterion (AIC) ED 
or Bayesian information criterion (BIC) 0 to introduce a 
penalty on the complexity of the model in the model selection 
procedure. However, in general AIC and BIC are shown to be 
inaccurate in estimating the number of states ma¬ 
in this paper, we propose a model selection criterion in¬ 
spired by the work of Tibshirani et al. DU who studied the 
clustering of i.i.d. points under Euclidean distance. The idea 
is to identify M by comparing the goodness of fit for the 
observed data with its expected value under a null reference 
distribution. To that end, we first draw a reference curve which 
plots the “goodness of fit” versus M based on the most non- 
informative distributed data, and describes how much adding 
new AR states improves the goodness of fit. We then draw 
a similar curve based on the observed data. In this work 
we choose the “goodness of fit” measure to be the mean 
squared prediction error (MSPE). Finally, the point at which 
the gap between the two curves is maximized is chosen as the 
estimated M. 

Besides the simplicity and effectiveness, another benefit of 
the proposed model selection criterion is that it is adaptive to 
the underlying characteristics of AR processes. The criterion 
for the processes of little dependency, i.e., the roots of whose 
characteristic polynomial are small, is different from the 
criterion for those of large dependency. In this sense, it takes 
into account the characteristics behind the observed data in an 
unsupervised manner, even though no domain knowledge or 


prior information is given. 

The remainder of the paper is outlined below. In Section ITT1 
we propose the Gap statistics for estimating the number of 
AR states in a time series. Section [III] formulates a specific 
class of the multi-state AR model, where the transitions 
between the states are assumed to be a first order Markov 
process. We emphasize that this parametric model is consid¬ 
ered primarily for simplicity and the proposed Gap statistics 
can be applied to general multi-state AR processes. A new 
initialization approach is also proposed that can effectively 
reduce the impact of a bad initialization on the performance 
of the expectation-maximization (EM) algorithm. Section m 
presents some numerical results to evaluate the performance of 
the proposed approach. Experiments show that the accuracy of 
the proposed approach in estimating the number of AR states 
surpasses those of AIC and BIC. 

II. Gap Statistics 

This section describes our proposed criterion for selecting 
the number of states in a multi AR process, inspired by HD- 
We draw a reference curve, which is the expected value of 
MSPE under a null reference distribution versus M, and use 
its difference with the MSPE of the observed data to identify 
the number of states, M. We show that computing each point 
of the reference curve turns out to be a clustering problem in 
the space of AR coefficients of a fixed size, where the distance 
measure for clustering is derived from the increase in MSPE 
when a wrong model is specified. We derive the distance 
measure in closed form, introduce an approach to generate 
stable AR filters that are uniformly distributed, and apply the 
fc-medoids algorithm to approximate the optimal solution for 
the clustering problem. We first outline our proposed model 
selection criterion in Subsection III-AI and then elaborate on 
the distance measure in Subsections IH-BI and the generation 
of random AR filters in Subsections III-CI 


A. The Model Selection Criterion 


We use superscript (n) to represent the data at time step 
n, and Nip, a 2 ) to denote the normal distribution with mean 
// and variance er 2 . Symbols in bold face represent vectors 
or matrices. We start from a simple scenario where the data 
{x( n \n = 1,2,...} is generated using a single stable AR 
filter ip a'- x^ = where = [1, 

• •• ,x ( ~ n ~ L ' ) ] T , ip A = [0ao,04i, ■ ■ ■ ,4>al] t , and are 
i.i.d. Af(0, a a)- Suppose we are at time step n — 1 and we 
want to predict the value at time n. If x^ = — ip’^x^ is 
used for prediction, the MSPE is E{(x^ + ip^x^) 2 } = 
E{(e^) 2 } = o\. But if another AR filter is used for 
prediction instead of ip a, i.e., x^ = — ipgX^ n \ the MSPE 
becomes E{( + i/j^x^) 2 }. The difference of the two 
MSPE is defined by 


D(lp A ,1p B ) = E \ X^+lpgX^ 




= E 


(0A - t/’s) T at (n) 


( 1 ) 


It is easy to observe that D(ip A ,ipB) is always nonnegative, 
which means that using the mismatch filter for prediction 
increases MSPE. We refer to D(ip A ,ipB) as the mismatch 
distance between two filters ip a and ips, though it is not a 
metric. When the data generated from ip a has zero mean, i.e., 
040 = 0, we let ip a also represents [ip A i, • • ■, 04 l] t of length 
L (with constant term omitted) with a slight abuse of notation, 
and we use ips in the same manner. 

As has been mentioned in Section [I] our model selection 
criterion is based on a reference curve that describes how much 
adding a new state increases the goodness of fit in the most 
non-informative or the “worst” case. To that end, we consider 
an M- state zero mean AR process where at each time step n, 
nature chooses random mismatch filters (with zero constants) 
for prediction. In such a worst scenario, the M filters that 
minimize the average mismatch distances to the random filters 
are naturally believed to be the true data generating filters, and 
that minimal value, which is the average MSPE, is plotted 
as the reference curve. This leads to the following clustering 
problem in the space of stable AR filters Rl{t) C R l , where 

L L 

R L (r) ={[Ai, ..., Al] t I z L + ^2 XiZ L - e = - af), 

e=i e=i 

A e € R, | at\ < r, 0 < r < 1,£ = 1,..., L}. 


Clustering of Stable Filters: For a fixed M, let ft = {ip 1 , 
02 , ■ • ■, 0f} be a set of uniformly generated stable filters 
of a given length L. We cluster f into M disjoint clusters 
Ci, C2 ,..., Cm, and define the within cluster sum of distances 
to be 


Wm = min 

Tl ,—)7M 


4e 


^ E 

m =1 


d ( 7m ,0n+i, ( 2 ) 


where D{ 7 m ,0) is defined in ([TJ and will be further sim¬ 
plified in ([4]), (0) and ©. By computing log (Wm) for M = 
1,..., M max , we obtain the reference curve. The optimization 
problem (Q} can be solved by the /c-medoids algorithm 1121 . 

The model selection criterion is outlined in Table Q] We 
note that the bound for the roots 0 < r < 1 is determined by 
the estimated filters, and thus the reference is data-dependent. 
Intuitively, if the process has less dependency, or in other 
words a point has less influence on its future points, the roots 
of the characteristic polynomials of each AR process are closer 
to zero and the MSPE curve will have smaller values. Thus, 
the filters from which the reference curve is calculated should 
also be drawn from a smaller bounded space. 


B. Distance Measure for Autoregressive Processes 

In this subsection, we provide the explicit formula for the 
distance in Equation i[T]i. Assume that the data is generated by 

L 

a stable filter ip a of length L. Let = Yh be the 

£=1 

characteristic polynomial of ip a, and let ai,..., cll denote the 












Algorithm 1 Model Selection Based on Gap Statistics 


Input: {x^ n \ n = 1,..., N}, M max (which is assumed to contain the “correct” number of states) 

Output: The number of AR states M opt . 

1: for M = 1 —» M max do 

2: Fit a multi-state AR model to the data (for instance using the EM algorithm described in Algorithm [4] ) 

3: Compute the MSPE Wm based on the estimated model. 

4: end for 

5: Plot log (Wm), M = 1,, M max , referred to as the “observed MSPE curve” 

6: Compute the largest absolute value of the roots of each estimated AR filter for the case M = M max , denoted by n,... Let 

r = min{max{ri,..., fM max }, 1}. 

7: for l = 1 —> Iter (number of iterations) do 

8: Run Algorithm [3] (to be introduced in Subsection 111-Cb to generate F (e.g. F = N) independent and uniformly distributed stable 

filters 3 = {ipi,... from Rl{t). 

9: for M = 1 —>- M max do 

10: Run Algorithm [2] to approximate the optimum of Q, and obtain \og{WMi), M = 1,..., M max . 

11: end for 

12: end for 

13: Let Wm = WMe/Iter. Plot log(l'PM), M = 1,..., M max as the reference curve (see Fig. [2] for an example). 

14: Choose M opt to be the smallest M (1 < M < M max ) that satisfies log [Wm) — log(W'Af) > log(VFM+i) — log(lpM+i) if there exists 
any; otherwise M opt = M max . 


Algorithm 2 Clustering Stable AR filters via “fc-medoids” Algorithm 


Input: A set of stable filters 3 = {ipi, ■ ■ ■ , iPf}, the number of desired clusters M, a number 0 < 5 < 1 (used for the stopping criterion). 
Output: The minimum within-cluster sum of distances (WCSD) we and {ip ci , ■ • •, ipc M } C 3 that approximate the M centers. 

1: Generate a matrix Dfxf whose elements are pairwise distances between biters: D uv = D(ip u ,‘ipv). 

2: Initialize M clusters characterized by centers c m and associated sets of indices I m (m = 1,..., M) that form a partition of {1,..., F}. 
3: Compute wi = %2 m=1 Eug i , ipu)- Let wo = 2wi/(l — S),l= 1 (for initialization purpose). 

4: while we-i — we > Swe-i do 
5: l = I + 1, we = we-i- 

6: for m = 1 —> M do 

7: Suppose that 7 m = {/ m [l],..., L[i m ]} and let k = 1. 

8: while k < im do 

9: Consider the candidates for the new centers, ci,... , cm, where c m ' = c m f ( m' = 1,..., M, m! ^ m) and c m = I m [k]. 

10: For each u == 1,..., F, let u € Im' if D(ip &m ,, ipu) < D(ipt } , ip u ) (j = 1,... ,M, j ^ m). 

11: Compute the WCSD given the new clusters: we = Em'=i E„gf , L , (' 1 Pc ,,ipu)- 

12: if we < we then 

13: k = 1, we = we. Cm = Im[k], I m ' = I m ' (m' = 1, ■ • ■, M). 

14: else 

15: k = k + 1. 

16: end if 

17: end while 

18: end for 

19: end while 


L 

roots of 1 + ^a{z), be., 1 + ^ a(z) = Jl (1 — ae/z), where 

e=i 

a ±,..., cll lie inside the unit circle {\ae\ < 1). Similarly define 
^b{z), bi,..., bi, for xpB- The value in 0} can be computed 
using the power spectral density and Cauchy’s integral theorem 
as: 


D (lp A ,1pB) 


Do (ip A ,ipB) + 


l L 

1 + E IpBl 

-- Ip AO + 4> BO 

1 + E V 1 At 

\ e=i 


( 3 ) 


where D 0 ( ipA,ipB) = 


a\ r |T A (e^)-'T B (e^) 
i-TT \l + ^ A (e ju )f 



k =1 


L 

I! («fe - be) 

t=i _ 

L 

ak II — a t) 

1= 1 


( L 

n(l- a k b e ) 

i=i 


n ctkcif) 

\t=l 


/ 


(4) 


for ak 0,afc ^ ae,k ^ I, where a* denotes the complex 
conjugate of a. For the degenerate cases when ak = 0 or 
a k = ae, D(tp A ,ipB ) reduces to lim ak -yo D(ip A ,ipB) or 



















lim a k ^a e D(xj) A , 1p B )- 

Remark 1. For now we assume that x 1 '’"' 1 at each state has zero 
mean by default, unless explicitly pointed out. We use -Do(’) 
in Identity 0 instead of D( ) in Identity 0 to compute the 
reference cur\’e. The derived reference curve can be applied 
to the general case. The reason is that it is more difficult to 
detect two AR states with the same mean than those that have 
different means. Therefore, the reference curves for the zero 
mean case (the “worst” case) can be used in general. 

The distance measure defined in Equation 0 is propor- 
tional to a\. We consider ct\ = a 2 which results in a 
constant logcr 2 in the computation of logWjyi in 0- Since 
it is the same for different M’s, we set <j 2 = 1 without loss of 
generality. 


The distance between two AR filters can be explicitly 
expressed in terms of the coefficients. This is computationally 
desirable if the filters are random samples generated in the 
coefficient domain, as will be discussed in Subsection Ill-CI 
Notations: Consider two polynomials of nonnegative 

powers p(z) and q{z ) respectively of degrees u > 0 and 
v > 0. Let q(z),pq{z) respectively denote the reciprocal 
polynomial of q{z), and the multiplication of p(z) and q(z), 
i.e., q(z) = z v q{z~ 1 ), pq{z) = p{z)q{z). Let Res(p{z), q(z)) 
be the resultant of p(z) and q(z). Define Po(p{z),q{z )) = 
Y^k= il( a k) anc * P°(p( z )i 0) = where ai,...,a u are the 
roots of p(z). 

Lemma 1. The values of Res(p(z), q{z)) and Po(p{z), q{z)) 
can be computed as polynomials of the coefficients of p(z) 
and q{z). 

The proof follows from the fact that the resultant of p(z) 
and q(z) is given by the determinant of their associated 
Sylvester matrix ED, and that for any n £ N, J2k -1 a k can 
be computed as polynomials in the coefficients of p{z) via 
Newton’s identities. We further provide the following result. 

Lemma 2. Let pa{z) = z L (l + ^a(z)) = II^iC 2 - 
ae),PB(z) = z L (l + ^siz)) = TlcLi( z - Pa( z ) = 
8(zpa{z)) /dz. The value of Dq{xPa,iPb) bi Equation 0 
(with a a = 1J can be computed in terms of the coeffi¬ 
cients of if a and tpB as in Equation 0 (on the top of 
the next page), where Ui = Po^pa{z),{p'wpa{z)) 1 ^, v % = 

Po(pa{z), (p , A (z)y' S j ,i = — 1, and the function 

is defined as S([si,..., s/j], t) = 

/ si-f 1 0 ••• 0 \ 

S 2 — t 2 si — t 2 '• : 

S det ; : o 

s h -i-t h ~ 2 : h -1 

\ S h -t h S h —i - f' 1-1 • • • S2 t 2 Sl-tJ 

for h > 0, Sf, •) = 1 for h = 0, and = 0 for h < 0, 

where det(-) denotes the determinant of a square matrix. 


Another simple way to compute the distance measure is 
given by the following lemma. 

Lemma 3. Let a = [fii, ■ ■ ■ , V’z,] 7 be the true filter of 
an autoregression with zero mean. The variance 70, the 
correlations pk = p~k {fi = 1 and the covariance 

matrix T of the autoregression are respectively defined to 
be 70 = E{(x^) 2 }, Pk = p- k = T = 

[70 pi-Define p=[pi,..., p L ] T , tf> k = 0 for k< 0 and 
k > L, S hl = 1 ifi= j and Stj = 0 otherwise (1 < i.j < L). 
Then p and 70 can be computed by 

p = -$~ 1 if>A, 7o = (1 + P T iPa)~ 1 , 

where & = is determined by = ipi+j + 

fi-j + Sij. The value of Dq (xjj a, fit b) in terms of ip a and 
ipB can be computed by 

D 0 {xpA, iPb) = {ipA ip B ) T T{xp A - IpB )■ ( 6 ) 


C. Generating Uniformly Distributed Filters with Bounded 
Roots 


As mentioned before. Gap statistics requires a reference 
curve that is calculated by clustering the filters randomly cho¬ 
sen from a reference distribution. In some scenarios we need 
to generate sample filters from Rl{c), where r is calculated 
from the observed data. Inspired by the work of Beadle and 
Djuric fl4l . we provide the following result on how to generate 
a random point in Rl{t) with uniform distribution 


Lemma 4. Generation of an independent uniform sample of 
\Ai,Li • • ■, A l,lY C Rl{c) can be achieved by the following 
procedure: 

1. Draw A17 uniformly on the interval [—r, r]; 

2. For k = 2,..., L, suppose that we have obtained 

[Ai,fc_i, ■ • ■, Afe_i ; fe_i] T that is uniformly distributed in 
Rk~i{r). Draw A k,k independently from a pdf proportional 
to the following function on the interval [—r k ,r k ] 


1 + 


Afc, 


LtJ 


1 - 


A k„ 


LVl 


(7) 


where 


A i k — Ai fc—1 T* 


Afc k ^k — i.k—1 


r,2k—2i 


(* = 1 .fc — 1 ). (8) 


Proof. We prove by induction. The pdf of Al .1 is pro¬ 
portional to one. For k > 1, suppose that the pdf of 

[Ai,fc-i, • ■ •, Afc_i i fc_i] T is proportional one inside I?fc_i(r) 
and zero elsewhere. Suppose that A k,k £ [— r k ,r k ] and 
Ai^,..., Xk-i,k are determined by 0. The Levinson-Durbin 
recursion in ® automatically enforces the stability con¬ 
straint that [Ai ; fc,..., Afc j fc] T falls inside Rk{r). The pdf of 
[Ai^,..., Afc i fc] T can be computed as 


p{Xi,k, • • •, Afc,fc) = p{Xk,k)p{Xi,k , ■ • ■, Xk-i,k | A fc,fc) 
— p(Afc,fc)p(Ai i fc_ 1 ,..., A* i _i i fc_i)| Jfc| 


oc p(Afc,fc) (l + A k,k/r k ) L ^ /2J (1 - A k,k/r k ) 


L(fc-i)/2J 







D 0 (lp A ,1pB) 


Po(pa{z),pbPb(z)^S([u 1 , ..., ml-i], 0) - Po(^pa{z),pbPbP , aPa{z)S([u 1 ,..., u L - 2 ],p' a Pa{z ))) 

Res(p A (z),p' A pA{z) S j 

Po(p A {z),p B {z)^S([v i,.. .,u L _i],0) - Po^ a (-),PbPa( 2: )‘ S '([' i; i>- • ■ i v l-i],p'a{ z ))) 

Res (p A (z),p’ A (z) S j 


( 5 ) 



Fig. 1: 10000 independent and uniformly distributed filters of 
L = 2 and the centers of two clusters, with r = 0.6, 0.8,1. 


1.4r 



Fig. 2: The reference curves for r = 0.6, 0.8,1, L = 4, which 
are obtained based on Iter = 32, P = 1000 (see Algorithm Q 3 . 


where J*, = det[d\i^/d\k-j,k-i]i<i,j<k-i is the Jacobian 
from Ato A fc _ iifc _i (i = 1,..., k - 1) taking A fcifc to be 
given. Therefore, if p{Ak,k) is proportional to the value given 
by 0, the joint pdf of Ai^,..., A k,k is proportional to one 
in Rk{r) and zero elsewhere. □ 

Remark 2. The technique presented in Lemma 0 can be 
equivalently formulated in a simple way summarized in the 
following lemma. The procedure is also described in Algo¬ 
rithm Q] 

Lemma 5. A sample of [Ai^ ..., A l,l) t that is uniformly dis¬ 
tributed in Rl{t) can be generated by the recursion Ao(z) = 
1, Ak(z) = zAk-i{z)+r k otkAk-i(z/r 2 ), where ak = 2(3k~l 
and j5k ~ Beta{\k/2 + lj, [(fc + 1)/2J), k = 1,..., L are 
independently generated. 

Fig. Q] illustrates the filters randomly generated from R 2 (r ) 
with r = 0.6,0.8,1. The centers of a two-clustering obtained 
using Algorithm U are also shown in this figure. These centers 
are calculated based on the average of 20 random instances, 
each with 1000 samples. Fig.[2]shows the reference curves for 
r = 0.6,0.8,1 and L = 4. 

III. Model 


model for illustration purpose, even though the model selection 
criterion proposed in Section |TT] is applicable to other multi¬ 
state AR models. 

A. Notations and Formulations 

Let S m denote the set of data points x (n) 

that are generated 

from state m. Suppose that x^~ L+1 \ ... , x^ are fixed and 
known. Let Z = 1 and Y = {y^}n=\ be a 

sequence of missing (unobserved) indicators, where z <n> is 
a M x M matrix, yOO is a M x 1 vector, and 

(n) f 1 if * (n_1) € S m and x {n) G S m >, 

mm | 0 otherwise, 

M _ f 1 lf x(n) e S ™, 

| 0 otherwise. 

Clearly, z00 = t/(" _1 )(y( n )) T . We note that y^ is a binary 
vector of length M containing a unique “1”; with a slight abuse 
of notation j/") is the location of that “1”. We assume that 
{y (n) }n^i is a Markov chain with transition probability matrix 
T, where P(x ( ' n '> G S' m ,x (n+1) € S m >) = T mm >, and yW is 
drawn from AJ(ai,..., a m ), where M. denotes the family of 
multinomial distributions. In other words, the assumed data 
generating process (given a fixed M) is: 


A popular way to describe the switching behavior between 
different states is to assume that the transition between the 
states follows a first-order Markov process. In this section, we 
adopt this assumption to formulate a parametric multi-state AR 


yin) 


\M{ax,...,a M ) if n = l, 

• • -,T y otherwise, 

N{- 7 yin)x( n \a 2 yM ), n = 2,...,N. 


(9) 

( 10 ) 

























Algorithm 3 Generating a uniform sample [Aj , A l,l\ t within (r) 

Input: L,r,Ao(z) = l. 

Output: A L (z) = z L + J2e=i A e,LZ L ~ e . 

1: for k = 1 — > L do 

2: Draw /3 k independently from the beta distribution /3 k ~ Beta([fe/2 + lj, |_(k + 1)/2J) 

3: Let a k = 2/3 k - 1 and A k (z) = zA k -i(z) + r k akAk^z/r 1 ). 

4: end for 


Let 0 = {7 m , a 2 m , Tmm' , m, m! = 1,..., M} be the set of 
unknown parameters to be estimated, where 7 ,,, is of length 
L + 1 (including the constant term). Though computing the 
maximum-likelihood estimation (MLE) of the above proba¬ 
bilistic model (fTot is not tractable, it can be approximated by a 
local maximum via the EM algorithm Ifl5i . The EM algorithm 
produces a sequence of estimates by the recursive application 
of E-step and M-step to the complete log-likelihood until a 
predefined convergence criterion is achieved. The complete 
log-likelihood can be written as 


N 

E log p{x {n) 

n =1 



N N 

= y y z {n) 

/ ^ / v mrr 

n= 1 m : m' — l 



f Tmm 1 'N 

V V^TTCTm’ J 



~l T m^ n) ) 2 \ 

2^ ) ' 


(ID 


For brevity, we provide the EM formulas below without 
derivation. In the E-step, we obtain a function of unknown 
parameters by taking the expectation of ( ITU with respect to 
the missing data Y and Z given the most updated parameters. 


N N / / T 

n= 1 m,m' — l x x 


r( n ) — ~TU 


7 m> xin) Y 




( 12 ) 


where 


„(«) 


/ = E(z%l, | 0 old ) = P(z/ (J1 - 1} = =m'\X) 

(13) 


can be computed recursively. We note that the parameters 
involved in the right-hand side of 4H take values from 
the last update. In the M-step, we use the coordinate ascent 
algorithm to obtain the following local maximum. The “old” 


superscriptions are omitted for brevity. 

N M 

E E ^* (n) (* (n) ) T 

n—1 m' — l 

( N M \ 

E E . 

n= 1 m' — l J 

N M , , 

E E ^m'm ( x{U) + 

2 n —1 m' — l 

CTm ^ N M 7T 

E E w£) m 

n=l m' — l 

N r \ 

W / 
mm' 

rri _ 71 — 1 

mm' — — ^ • 

E 

m' — l n—1 

B. Initialization of EM 

The convergence speed of the EM algorithm strongly de¬ 
pends on the initialization and an improper initialization can 
cause it to converge to a local maximum which is far away 
from the global optimum. A routine technique is to use 
multiple random initializations and choose the output with 
the largest likelihood 11161, but this can be significantly time 
consuming. Here, we use a new initialization technique to 
get a fast and reliable convergence for the EM algorithm. 
This technique is based on the fact that for time series 
obtained in most practical areas, the self-transition probability 
of the states is usually close to one, i.e., T mm ~ 1. By 
adopting this assumption, we propose the initialization method 
in Algorithm 0] which is shown empirically to produce more 
reliable and efficient EM results. We note that the “split” style 
rule that appears in line 5 of Algorithm [4] is used elsewhere 
(e.g. s El). 




IV. Numerical Experiments 

This section presents numerical results to evaluate the 
performance of the proposed technique. 

Fig. [3] shows a time series generated from a 3-state AR 
model with L = 4. The observed MSPE curve associated with 
the time series shown in Fig. 3 and the reference curve for 
L = 4 are plotted in Fig. Q] The gap between the two curves 
is maximized at M = 3. Thus, the selected M is 3. In order 
to compare the performance of the proposed technique with 
those of AIC and BIC, we have generated synthetic time series 
under three different scenarios and apply each technique on 
those data to estimate the number of states. The three scenarios 













Algorithm 4 EM algorithm with the proposed initialization approach 
Input: X = {x (n >};7 =1 . 

Output: The initial parameters 0 M = < f M = Sm = {a^}" =1 , T M = Uf M )mm ') \ , M = 1,..., M max . 

1 : lor M = I —> M max do 
2 : for n = 1 —> N — TVo + l^do 

3: Estimate the AR filter £„ and the noise variance a ^ from the sequence {a/ 71 ), ..., x ^ n+N °~ ] )} via the least squares method. 

4: end for 

5: Cluster £i,..., £jv-jv 0 +i into M cluster using fc-means algorithm and obtain the centers Qi,..., qm with the corresponding noise 

variances <fi, ■ ■ ■ ,?m- Pick up such Qk (1 < k < M) that maximize the sum of Euclidean distances to 71 ,... , 7 m- 1 . 

6 : if M > 1 then 

7: Let t M = U Em = Em-i U , ( f M )mm' = 1/M (m, m! = 1,..., M). 

8 : else 

9: fi = Q k , Ei = Cfc, T\ = 1. 

10 : end if 

11: Run EM updates described in CED-COD till certain stopping criterion is achieved. 

12 : end for 
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TABLE I: The estimated number of AR filters for three different scenarios using AIC, BIC and Gap statistics (with the true 
number of filters for each scenario highlighted) 
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Fig. 3: A random instance of multi-state AR time series: 

L = 4,M = 3, T m m = 0.98, fj, m = 0 ,a m = 1 ,T mrn > = 
0.01, m, m' = 1,..., 3, m ^ m'. 



Fig. 4: The reference curves and the observed MSPE curve 
for the time series shown in Fig. 0 The gap between the two 
curves is maximized at M = 3. 


are as follows: 

Scenario 1: ( L,M,r) = (4,3,1) ; 

Scenario 2: (L, M, r) = (1,4,0.8); 

Scenario 3: ( L,M,r) = (2, 2,0.6). 

For each scenario, 100 instances of multi-state AR time series 
of length N = 1000 are independently generated, each of 
which consists of M autoregressive filters which are uniformly 


drawn from the f?i(r) space. For each AR, the mean is 
uniformly generated from [—4,4] and the variance is assumed 
to be 1. The transition matrix is considered to be T mm = 
0.98 ,T mm i = 0.02/(M — 1) for m,m' = 1 ^ 

m'. For each instance, the model parameters for each fixed 
M = 1,... ,M max are estimated using EM algorithm, where 
M rnax = 6. Table 1 shows the estimated number of AR filters 




























































using AIC, BIC and Gap statistics, where two types of Gap 
statistics are used to estimate the number of states. In the first 
type, denoted by Gap (U), the reference curve is generated 
from sample AR filters that have roots inside the unit circle, 
and is therefore independent of the data. In the second form of 
the Gap statistics, represented by Gap (B), the sample filters 
are restricted to have roots inside a circle with radius r, where 
r is calculated from the data based on Algorithm|T] According 
to these results. Gap (B) outperforms AIC and BIC in all three 
scenarios, and it gives a better estimate of the number of states 
compared with Gap (U) since it is adaptive to the data. 

V. Conclusions 

In this paper we proposed a model selection technique to 
estimate the number of states in a time series. The proposed 
approach, referred to as the Gap statistics, uses a reference 
curve to check whether adding a new state significantly de¬ 
creases the prediction error. The reference curve is calculated 
by clustering uniformly generated stable AR filters. Numerical 
results show that the performance of the proposed model 
selection criterion surpasses those of AIC and BIC. 
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